High availability management for a hierarchy of resources in an sddc

ABSTRACT

Some embodiments provide a hierarchical data service (HDS) that manages many resource clusters that are in a resource cluster hierarchy. In some embodiments, each resource cluster has its own cluster manager, and the cluster managers are in a cluster manager hierarchy that mimics the hierarchy of the resource clusters. In some embodiments, both the resource cluster hierarchy and the cluster manager hierarchy are tree structures, e.g., a directed acyclic graph (DAG) structure that has one root node with multiple other nodes in a hierarchy, with each other node having only one parent node and one or more possible child nodes.

BACKGROUND

In recent years, systems that manage software defined datacenters (SDDC)have provided greater controls for managing a larger number of resourcesin the datacenters. These systems allow compute, network, and serviceresources to be managed often through a single user interface. Moreover,the complexity of managing the SDDC resources has increased with theadvent of multi-cloud operations, as more resources now have to becontrolled across multiple clouds, which may be different from oneanother.

BRIEF SUMMARY

Some embodiments provide a hierarchical data service (HDS) that managesmany resource clusters that are in a resource cluster hierarchy. In someembodiments, the HDS is a multi-cloud data service (MCDS) that managesseveral resource clusters in two or more public or private clouds. Theresources in some of these embodiments include compute resources such asdatacenters, host computers, machines (e.g., virtual machines, Pods,containers, etc. executing on host computers), standalone servers,processors of host computers, processor cores of processors, graphicalprocessing units, co-processors, memories of host computers and/orstorages. Conjunctively, or alternatively, the resources include networkelements such as gateways, routers, switches, middlebox service machinesand appliances, etc.

In some embodiments, each resource cluster has its own cluster manager,and the cluster managers are in a cluster manager hierarchy that mimicsthe hierarchy of the resource clusters. In some embodiments, both theresource cluster hierarchy and the cluster manager hierarchy are treestructures, e.g., a directed acyclic graph (DAG) structure that has oneroot node with multiple other nodes in a hierarchy, with each other nodehaving only one parent node and one or more possible child nodes. Asfurther described below, other embodiments use other hierarchicalstructures, e.g., ones allowing a child cluster manager to have multipleparent cluster managers.

Each cluster manager (CM) in some embodiments connects to an upstreamcluster manager to send state up to ancestor clusters (e.g., parentclusters, grandparent clusters, great grandparent clusters, etc.) and toreceive desired state (e.g., instructions) from ancestor clusters. Thisarchitecture is very flexible and scalable in that it allows moreresource clusters and cluster managers to be added horizontally orvertically.

In order to ensure that the management hierarchy does not getoverwhelmed with updates from progeny clusters (e.g., child clusters,grandchild clusters, great grandchild clusters, etc.), some embodimentsemploy novel processes for ancestor clusters to receive states fromprogeny clusters. These embodiments also employ novel processes fordistributing desired state requests from ancestor clusters to progenyclusters. Some embodiments further employ novel high availability (HA)architectures to ensure that the hierarchical management system does notcompletely fail when one or more cluster managers fail. These processesand architectures allow the cluster management hierarchy (and in turnthe resource cluster hierarchy) to scale very easily, and havereasonable failure semantics.

More specifically, scalability requires the management hierarchy toimpose some limits on the amount of data sent to the upstream clustermanagers, and on the data coming down to make changes to the desiredstate. The management hierarchy in some embodiments limits theinformation sent in upstream by specifying how many levels a clustersends up exact information. For the levels that are past a maximumupstream propagation level L from a particular cluster's level, themanagement hierarchy only sends up a summary, e.g., enough to allow theupper levels to manage some aspects of the no longer visible clusters,but limited so as not to overwhelm the system. This results in anycluster having only a clear view of a few layers, and some data aboutthe rest of the system that is hidden. Like a fractal system, themanagement hierarchy of some embodiments allows an administrator to zoomin to any cluster, see a few levels, and then zoom into one of those tosee more information. To support this zoom-in feature, the hierarchicalmanagers at the top level (e.g., at the root node) or at each of thelower levels can direct lower level managers to provide additionalinformation on a need basis.

In some embodiments, all the cluster have the same maximum upstreampropagation level L, while in other embodiments the maximum upstreampropagation level L can be defined for each cluster independently ofother clusters. In still other embodiments, all clusters at the samelevel of the cluster hierarchy have the same maximum upstreampropagation level L.

In order to limit the amount of data sent up, the management hierarchyin some embodiments sums up the data from the cluster managers belowlevel L, adds those values to the cluster manager at level L, and thenreports the aggregated data upstream. For example, if cluster manager Xis at level L and has 10 cluster managers reporting to it, each of whichis responsible for clusters with 12 servers with 24 cores each, thecluster manager X in some embodiments adds 120 servers and 2880 cores tothe data of the cluster managed by the cluster manager X beforereporting cluster X's value upstream to its parent manager (i.e., themanager of the cluster manager X). In essence, the management hierarchytreats the cluster managed by the cluster manager X as if it containsall the clusters reporting to it.

While the above-described approach solves the amount of data sent up,there is still a risk that too many updates need to be sent upstream, asin this example adding any server to any of the 10 clusters requires anupdate. The management hierarchy in some embodiments addresses thisissue by requiring cluster managers to report only changes to theupstream cluster's manager when the data change is significant (e.g.,greater than a threshold value). For example, the management hierarchyin some embodiments specifies that a data update is only sent up whenthe data has changed by more than 1% from the last time that it wasreported. For the above-described example, the management hierarchywould only send an update to the number of cores when at least 29 coreshave been added or removed from the clusters reporting to cluster X.

A change in desired state can be sent down easily to all clustermanagers that are visible from an ancestor cluster manager (e.g., fromthe top cluster manager). However, given that not all progeny clustermanagers are visible to an ancestor cluster manager (e.g., to the topcluster manager), the management hierarchy of some embodiments usesnovel processes to manage top-down desired state distribution in ascalable manner. For instance, in some embodiments, desired state can befurther distributed with uniform commands to all the progeny clustermanagers in the hierarchy, e.g., with commands such as “upgrade allObject Stores to Version 3” or “make sure any object store has at least30% free capacity,” which might prompt some lower-level manager to moveobjects across various clusters to balance the system.

Also, some embodiments employ requests with criteria that allow theprogeny cluster managers to make decisions as to how to implement therequests. For instance, in some embodiments, an ancestor cluster managercan send a request to all its progeny cluster managers to find anoptimal placement for a single instance of a resource, e.g., “find acluster that has 5PB free storage, and 40 servers with GPU.” For such arequest, each progeny cluster manager sends the request to any clustermanager downstream such that any downstream cluster manager that hasenough space report up with a number that defines how “good” thatrequest would fit and possibly how much these resources would cost.

Each cluster manager that gets such a report from a downstream clustermanager discards the report of the downstream cluster manager when italready has a better one. On the other hand, each particular progenycluster manager sends up a report from a downstream cluster manager whenthe report is better than the particular progeny cluster manager's ownreport (if any) and other reports provided by downstream clustermanagers of the particular progeny cluster manager. The top clustermanager in some embodiments accepts the placement identified in thefirst report that it receives, or the best report that it receives aftera certain duration of time, or the best report that it receives afterreceiving responses from all of its direct child cluster managers (i.e.,all the direct child cluster managers of the top cluster manager).

In some embodiments, cluster managers that get the state change need todecide how to translate it for cluster managers reporting in. Considerthe following example of a desired state change: “start as few aspossible database servers of type X to collect detailed stats for allclusters.” This desired state change is then delegated from an ancestorcluster manager (e.g., the root node cluster manager) down to all itsprogeny cluster managers in order to delegate the “placement decisionmaking” down from the root cluster manage to its downstream clustermanagers.

The management hierarchy of some embodiments works well with policiesand templates that can be pushed down to ensure that all clustermanagers have a uniform list of policies, or by pushing them up so thatthe top cluster manager knows which policies are supported by thedownstream cluster managers.

When cluster managers are in a DAG structure, and a particular clustermanager fails or a connection between the particular cluster manager andits child manager fails, its ancestor cluster managers no longer havevisibility into the progeny cluster managers of the particular clustermanager. To address such cluster manager failures (to create a highavailability management hierarchy), different embodiments employdifferent techniques. For instance, in some embodiments, each clustermanager has a list of possible upstream cluster managers so that whenthe cluster manager's parent CM fails the cluster manager can identifyanother upstream cluster manager on the list and connect to theidentified upstream manager as its new parent cluster manager. In orderto keep all data correct, some of these embodiments require that anycluster that loses contact with a downstream cluster immediately removethe data it reports up in order to avoid the data from getting reportedtwice to upstream cluster managers.

Other embodiments allow a cluster manager to connect to more than oneupstream cluster manager and split the resources in some way betweenthose cluster managers. For instance, each cluster manager reports itsdata to about X% (e.g., 25%) to each of N (e.g., 100/X) upstream clustermanagers. If one upstream CM fails or the connection between this CM andits child CM fails, only a fraction of the resources will be temporarilyinvisible to the upstream cluster managers. This HA approach is combinedwith the above-described HA approach in some embodiments to allow achild CM to connect to another parent CM when its prior parent CM, orthe connection to its prior parent CM, fails.

Some embodiments also employ novel processes to avoid loops in themanagement hierarchy. When any cluster manager connects to an upstreamcluster manager that in turn, possibly over several connections,connects to itself, a loop is formed in the management hierarchy. Someor all of the cluster managers in the hierarchy in some embodiments areconfigured to detect such loops by detecting that the data that theycollect increases without bounds.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all inventive subject matter disclosed in this document.The Detailed Description that follows and the Drawings that are referredto in the Detailed Description will further describe the embodimentsdescribed in the Summary as well as other embodiments. Accordingly, tounderstand all the embodiments described by this document, a full reviewof the Summary, Detailed Description, the Drawings and the Claims isneeded. Moreover, the claimed subject matters are not to be limited bythe illustrative details in the Summary, Detailed Description, andDrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purposes of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 conceptually illustrates a DAG being a hierarchy of resourceclusters.

FIG. 2 conceptually illustrates a DAG a hierarchy of cluster managersfor the resource clusters of the DAG illustrated in FIG. 1 .

FIG. 3 illustrates a process performed by each particular clustermanager to aggregate the data that it receives from its progeny clustermanagers and to forward this data to its parent cluster manager.

FIG. 4 illustrates an example in which different cluster managers canhave different maximum levels even with cluster managers at the samelevel of the management hierarchy.

FIG. 5 illustrates a process for displaying data collected by thecluster managers.

FIG. 6 illustrates the resource cluster hierarchy for the clustermanager hierarchy illustrated in FIG. 4 .

FIG. 7 illustrates the administrator performing a zoom-in operation onresource cluster after seeing the data collected in FIG. 6 .

FIG. 8 illustrates a desired state distributed as a uniform command toall the progeny cluster managers in the hierarchy.

FIGS. 9-11 illustrates another way to distribute desired state.

FIG. 12 illustrates a process that each non-root cluster managerperforms in some embodiments to be connected at all times to one parentcluster manager.

FIG. 13 illustrates a process that a parent cluster manager performs insome embodiments to manage relationship with its child cluster managers.

FIG. 14 illustrates an embodiment in which each cluster manager reportsits X% of its data to each of N upstream cluster managers.

FIG. 15 illustrates examples of cluster managers and resource clustersin multiple datacenters.

FIG. 16 conceptually illustrates a computer system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it will be clear and apparent to one skilled in theart that the invention is not limited to the embodiments set forth andthat the invention may be practiced without some of the specific detailsand examples discussed.

Some embodiments provide a hierarchical data service (HDS) that managesmany resource clusters that are in a resource cluster hierarchy. In someembodiments, each resource cluster has its own cluster manager (CM), andthe cluster managers are in a cluster manager hierarchy that mimics thehierarchy of the resource clusters. In some embodiments, both theresource cluster hierarchy and the cluster manager hierarchy are treestructures, e.g., a directed acyclic graph (DAG) structure that has oneroot node with multiple other nodes in a hierarchy, with each other nodehaving only one parent node and one or more possible child nodes.

FIGS. 1 and 2 illustrate two DAGs 100 and 200, with the first DAG 100being a hierarchy of resource clusters and the second DAG 200 being ahierarchy of cluster managers for the resource clusters of the first DAG100. In some embodiments, the HDS is a multi-cloud data service (MCDS)that manages several resource clusters in two or more public or privateclouds. The resources in some of these embodiments include computeresources such as datacenters, host computers, machines (e.g., virtualmachines, Pods, containers, etc. executing on host computers),standalone servers, processors of host computers, processor cores ofprocessors, graphical processing units, co-processors, memories of hostcomputers and/or storages.

Conjunctively, or alternatively, the resources in some embodimentsinclude network elements such as gateways, routers, switches, middleboxservice machines and appliances, etc. In yet other embodiments, theresources include other elements in datacenters and computer networks.Also, some embodiments are used to manage just one type of resource(e.g., storage or compute) at multiple levels of hierarchy in one ormore datacenters.

The DAG 100 includes multiple levels of resource clusters, while the DAG200 includes multiple levels of cluster managers with each clustermanager corresponding to a resource cluster in the DAG 100. In FIGS. 1and 2 , three levels are explicitly illustrated but these DAGs caninclude many more levels, e.g., tens of levels, etc. Each DAG 100 or 200has one root node 102 and 202, respectively, and multiple other nodes,with the root node having multiple child nodes and no parent node, andeach other node in the DAG having only one parent node and one or morepossible child nodes.

In DAG 200 of FIG. 2 , each cluster manager is a group of one or moremachines (VMs, Pods, containers, etc.) or standalone servers thatmanages a resource cluster in the DAG 100 of FIG. 1 . Also, in DAG 200,each cluster manager at most has only one parent cluster manager and canhave one or more child managers. In this structure, each cluster managerconnects at most to one upstream cluster manager to send state up toancestor cluster managers (e.g., parent cluster managers, grandparentcluster managers, great grandparent cluster managers, etc.) and toreceive desired state (e.g., instructions) from ancestor clustermanagers.

Also, in this structure, the root cluster manager 202 has no parentcluster manager, and connects to no upstream cluster managers. Asfurther described below, other embodiments use other hierarchicalstructures, e.g., ones allowing a child cluster manager to have multipleparent cluster managers. In addition, other embodiments also have onecluster manager manage multiple resource clusters at the same ordifferent levels of the resource cluster hierarchy 100.

The structure of the DAG 100 or 200 is very flexible and scalable inthat it allows more resource clusters or cluster managers to be addedhorizontally or vertically, and therefore is an ideal approach toaddress the complexity problem of allowing users to manage arbitrarilylarge and diverse systems across many clouds and clusters. In order toensure that the management hierarchy does not get overwhelmed withupdates from progeny cluster managers (e.g., child cluster managers,grandchild cluster managers, great grandchild cluster managers, etc.),some embodiments employ novel processes for ancestor cluster managers toreceive states from progeny cluster managers. These embodiments alsoemploy novel processes for distributing desired state requests fromancestor clusters to progeny clusters. Some embodiments further employnovel high availability (HA) processes and/or architectures to ensurethat the hierarchical management system does not completely fail whenone or more cluster managers fail. These processes and architecturesallow the cluster management hierarchy (and in turn the resource clusterhierarchy) to scale very easily, and have good failure semantics.

Scalability requires the management hierarchy to impose some limits onthe amount of data sent to the upstream cluster managers, and on thedata and/or instructions coming down to make changes to the desiredstate. The management hierarchy in some embodiments limits theinformation sent upstream by specifying how many levels a clustermanager sends up exact information. For the levels that are past amaximum upstream-propagation level L from a particular cluster manager'slevel, the management hierarchy only sends up a summary, e.g., enough toallow the upper levels to manage some aspects of the no longer visibleclusters, but limited so as not to overwhelm the system. In someembodiments, all the cluster managers have the same maximum level L,while in other embodiments the maximum level L can be defined for eachcluster manager independently of other cluster managers. In still otherembodiments, all cluster managers at the same level of the clusterhierarchy have the same maximum level L.

In order to limit the amount of data sent up, the management hierarchyin some embodiments sums up the data from the cluster managers belowlevel L, adds those values to the data produced by the cluster managerat level L, and then reports the aggregated data upstream. For example,if cluster manager X is at level L and has 10 cluster managers reportingto it, each of which is responsible for clusters with 12 servers with 24cores each, the cluster manager X in some embodiments adds 120 serversand 2880 cores to the data of the cluster managed by the cluster managerX before reporting cluster X's value upstream to the parent manager ofthe cluster manager X. In essence, the management hierarchy treats thecluster managed by the cluster manager X as if it contains all theclusters of all of its child cluster managers.

Imposing maximum level(s) for propagating exact data upstream in themanagement hierarchy results in any cluster having only a clear view ofa few layers, and some data about the rest of the system that is hidden.Like a fractal system, the management hierarchy of some embodimentsallows an administrator to zoom in to any cluster, see a few levels, andthen zoom into one of those to see more information. To support thiszoom-in feature, the hierarchical managers at the top level (e.g., atthe root cluster manager) or at each of the lower levels can directlower level managers to provide additional information on a need basis.

While the above-described approach solves the amount of data sent up,there is still a risk that too many updates need to be sent upstream.For instance, in the above-described example, adding any server to anyof the 10 clusters could require an update. The management hierarchy insome embodiments addresses this issue by requiring cluster managers toreport only changes to the upstream cluster's manager when the datachange is significant (e.g., greater than a threshold value). Forexample, the management hierarchy in some embodiments specifies that adata update is only sent up when the data has changed by more than 1%from the last time that it was reported.

FIG. 3 illustrates a process 300 performed by each particular clustermanager to aggregate the data that it receives from its progeny clustermanagers and to forward this data to its parent cluster manager. In someembodiments, the particular cluster manager performs the process 300continuously to report up-to-date data from its progeny when there issufficient amount of new data to report (e.g., amount of new dataexceeds a threshold level).

As shown, the process 300 receives (at 305) data from one or more of itsprogeny cluster managers. In some embodiments, the received data can beany kind of data, such as configuration data, operational state data,metric data, etc. In some embodiments, each cluster manager receivesdata only from its child cluster managers, which, in turn receive datafrom their child cluster managers, and so on.

At 310, the process 300 identifies any of its progeny cluster managersthat have reached their maximum upstream propagation level (MUPL). Asmentioned above, all the cluster managers have the same maximum level L,while in other embodiments the maximum level L can be defined for eachcluster manager independently of other cluster managers. In still otherembodiments, all cluster managers at the same level of the clustermanagement hierarchy have the same maximum level L. Also, in someembodiments, the MUPL for a cluster can change over time, e.g., as moreclusters are added, the MUPL for some or all of the clusters can bedecreased in some embodiments in order to avoid data overflow.

FIG. 4 illustrates an example in which different cluster managers canhave different maximum levels even with cluster managers at the samelevel of the management hierarchy. Specifically, this figure illustratesa cluster manager hierarchy 400 that has most of the cluster managerwith a maximum level 5, a few cluster managers with a maximum level 4,and one cluster manager 405 with an MUPL 3. In this example, clustermanager 405 is a great grandchild of the cluster manager 410.Accordingly, the cluster manager 410 identifies the cluster manager 405as having reached its maximum upstream-propagation level (of 3) once itsdata has reached the cluster manager 405. As all of its progeny clustermanagers have a maximum level of 4 or higher, the cluster manager 410does not identify any other cluster manager as having reached itsmaximum upstream-propagation level.

For each progeny cluster manager identified at 310, the process 300 (at315) combines the data reported by the identified progeny clustermanager with the data reported by the cluster manager that is the parentof the identified progeny cluster manager. For instance, in the exampleof FIG. 4 , the cluster manager 410 combines the data reported by thecluster manager 405 with its parent cluster manager 415. The data fromthe cluster manager 415 does not need to be combined with the data ofits other child cluster 422 because this child cluster 422 has notreached its maximum upstream-propagation level (of 4) when its datareaches the cluster manager 410.

As mentioned above, each cluster manager in some embodiments onlyreports data upstream when it has collected sufficient data since itslast upstream data report (e.g., it has collected more than a thresholdamount of new data since its last upstream data report). Hence, at 320,the process 300 determines whether it has collected sufficient amount ofdata since its last upstream report. In the above described example,when the management hierarchy in some embodiments specifies that a dataupdate is only sent up when the data has changed by more than 1% fromthe last time that it was reported, the process 300 would only send anupdate to the number of cores when at least 29 cores have been added orremoved from the clusters reporting to cluster X.

When the process 300 determines (at 320) that it has not collectedsufficient amount of data since its last upstream report, the processreturns to 305 to receive more data from its progeny cluster managers(e.g., from its child cluster managers). On the other hand, when theprocess 300 determines (at 320) that it has collected sufficient amountof data since its last upstream report, it sends (at 325) the new datathat it has collected for its own resource cluster and the new data thatit has collected from its progeny cluster managers to the parent clusterof the particular cluster manager that is executing process 300. After325, the process 300 returns to 305 to receive additional data from theprogeny cluster managers.

In sending its upstream report, the process 300 generates, for eachidentified progeny cluster manager that has reached its maximumupstream-propagation level, a combined report that combines theidentified progeny cluster manager's data with its parent clustermanager. For instance, in the example illustrated in FIG. 4 , theprocess 300 generates a combined report that combines the data reportedby the cluster manager 405 with its parent cluster manager 415. In theexample illustrated in FIG. 4 , the cluster manager 410 does not combinethe new data of any other progeny cluster managers as it does notidentify any other progeny cluster managers that at the level of thecluster manager 410 has reached its maximum upstream-propagation level.

Different embodiments generate different types of combined reports. Someembodiments simply add the data of the identified progeny clustermanagers that have reached their maximum upstream-propagation level tothe data of their parent cluster managers (e.g., add the data of thecluster manager 405 to the data of its parent cluster manager 415) sothat to the upstream cluster managers (i.e., to the ancestors clustermanagers 420 and 425 of the cluster manager 410) the data appears to bethe data of the parent cluster managers (i.e., of the parent clustermanager 415). In other embodiments, the process 300 reports the data ofthe parent cluster manager (e.g., manager 415) without the aggregationbut appends to the parent cluster manager's data a separate datastructure that summarizes the data of the progeny cluster manager(s) ofthe parent cluster manager (e.g., cluster manager 415).

When a parent cluster manager receives the state data that its childcluster manager sent at 325, the parent cluster manager performs its ownprocess 300 immediately or periodically to pass along the received datato its parent (i.e., to the grandparent cluster manager of the childcluster manager) if the changes in the state data require its forwardingto its parent. In some cases, the parent cluster manager generates atleast one additional aggregated datum from the received detailed dataand/or the received aggregated datum before providing the data to thegrandparent cluster manager. The reporting of one child cluster managercan trigger the operation of the process 300 by all ancestor parentclusters until updated state data reaches the root cluster manager. Itshould be noted that some embodiments do not perform the thresholdingoperation at 320, as the child cluster managers report all state changesupstream to their respective parent cluster managers.

FIG. 5 illustrates a process 500 for displaying data collected by thecluster managers. As described above, the use of maximumupstream-propagation level results in any cluster manager having (1) aclear view of a few levels of the resource hierarchy and (2) some dataabout the rest of the resource hierarchy that is hidden. Like a fractalsystem, the process 500 allows an administrator to zoom in to anycluster, see a few levels, and then zoom into one of those to see moreinformation.

Through a user interface supported by a set of web servers, theadministrator in some embodiments interacts with the root clustermanager (e.g., cluster manager 202) to use the zoom-in display featureprovided by the process 500. In other embodiments, the administratorscan directly access some or all cluster managers through theirrespective user interfaces that are supported by the same set ordifferent sets of webservers. The process 500 is performed by thecluster manager with which the administrator interacts.

As shown, the process 500 starts when the cluster manager receives (at505) identification of a set of resource clusters to examine. In someembodiments, the process 500 receives this identification as part of azoom request that identifies one resource cluster as the resourcecluster that should be the focus of the zoom operation. Thisidentification in some embodiments is a request to review data regardingthe resource cluster that is subject of the zoom operation along withthe data of this resource cluster's progeny clusters.

Next, at 510, the process sends a command to the first ancestor clustermanager of all of the resource clusters identified at 505. In theembodiments where one particular cluster manager is identified at 505 asthe resource cluster that is the focus of the zoom request, the process500 sends (at 510) the command to the cluster manager of the identifiedresource cluster for the zoom request. The command directs the addressedcluster manager to provide data for its resource cluster and data fromits progeny cluster managers for the resource clusters that they manage.For the progeny resource clusters that have reached their maximum level(at the level of the addressed cluster manager), their data isaggregated with the date of their parent resource clusters because ofthe maximum upstream-propagation level criteria.

The process 500 receives (at 515) the requested data from the clustermanager to which it has sent its request at 510. At 520, the processgenerates a report that illustrates the data collected from the clustermanager, and presents through the user interface this generated reportto the administrator. This report is a report that the administrator canexplore and navigate through traditional UI controls (e.g., drop downmenus, pop-up windows, etc.) to see various presentations and details ofthe received requested data.

If the administrator ends (at 525) his exploration of the data collectedby the cluster manager hierarchy, the process 500 ends. Otherwise, whenthe administrator continues his exploration of the resource clusterhierarchy, the process returns to 505, where it receives identificationof another set of resource clusters (e.g., it receives another resourcecluster to zoom into in order to view its data and the data of itsprogeny), and then repeats the operations 510-525 for this set ofresource clusters.

FIGS. 6 and 7 illustrate examples of using process 500 to explore theresource cluster hierarchy. FIG. 6 illustrates the resource clusterhierarchy for the cluster manager hierarchy illustrated in FIG. 4 . Inthe example illustrated in FIG. 6 , the administrator interacts with theroot cluster manager to select resource cluster 610 as the resourcecluster to zoom-in on. Based on this selection, the root cluster managercollects data from cluster manager 410 for its corresponding resourcecluster 610, and for the progeny resource clusters of the resourcecluster 610.

As shown in FIG. 6 , most of the progeny resource clusters have MUPLs of4 and 5, but the resource cluster 605 has a MUPL of 3. This means thatthe cluster manager 410 would return unaggregated, granular data for theresource cluster 610 and all of its progeny clusters 611-614, except forits progeny resource cluster 605 and 615. The cluster manager 410 of theresource cluster 610 aggregates the data for the resource cluster 605with the data for its parent resource cluster 615, because at the levelof the cluster manager 410 and its corresponding resource cluster 610,the resource cluster 605 has reached it maximum upstream-propagationlevel.

FIG. 7 illustrates that after seeing the data that was collected fromthe resource cluster 610 and its progeny clusters in FIG. 6 , theadministrator performs a zoom-in operation on resource cluster 615.Based on this selection, the root cluster manager collects data fromcluster manager 415 for its corresponding resource cluster 615, and forthe progeny resource clusters of the resource cluster 615. To collectthe data from the cluster manager 415, the root cluster manager in someembodiments directly communicates with the cluster manager 415 in someembodiments. In other embodiments, the root cluster manager sends itsrequest for additional data from the cluster manager 415 through theintervening cluster managers between the root cluster manager and thecluster manager 415. In these other embodiments, the cluster manager 415would receive the zoom-in data request operation from its parent clustermanager, which in turn receives it from its parent, and so on.

As shown in FIG. 7 , most of the progeny resource clusters of theresource cluster 605 have an MUPL of 3, one resource cluster 614 has anMUPL of 4, and one resource cluster has an MUPL of 2. Based on theseMUPLs, the cluster manager of resource cluster 705 would aggregate thedata of the resource cluster 702 with that of its parent resourcecluster 704, and the cluster manager 405 of the resource cluster 605would aggregate the data of the resource cluster 706 with that of itsparent cluster 704.

Moreover, the cluster manager 415 would aggregate the data that it getsfor resource cluster 704 (which includes the data for resource clusters702 and 706) with the data from resource cluster 705 along with the datafrom the resource cluster 708. The cluster manager 415 would aggregatethe data that it gets for resource cluster 712 and 714 with the data oftheir parent resource cluster 716, as both clusters 712 and 714 have anMUPL of 3, which has been reached at the level of the resource cluster615 and the cluster manager 415.

A change in desired state can be sent down easily to all clustermanagers that are visible from an ancestor cluster manager (e.g., fromthe top cluster manager). However, given that not all progeny clustermanagers are visible to an ancestor cluster manager (e.g., to the topcluster manager), the management hierarchy of some embodiments usesnovel processes to manage top-down desired state distribution in ascalable manner.

For instance, FIG. 8 illustrates that in some embodiments, desired statecan be distributed as a uniform command to all the progeny clustermanagers in the hierarchy, e.g., with commands such as “upgrade allObject Stores to Version 3” or “make sure any object store has at least30% free capacity,” which might prompt some lower-level manager to moveobjects across various clusters to balance the system. This uniformcommand is expressed with objective, standard criteria that can bedeciphered by all cluster managers. As shown, the root cluster manager802 sends this command to its child cluster managers 804 and 806, whichin turn send this command to their child cluster managers 808, 810 and812, and so on.

The approach illustrated in FIG. 8 works well in hierarchical managementsystems that distribute policies to all cluster managers so that theyhave a uniform list of policies. In some embodiments, each particularcluster manager for a particular resource cluster receiving a set ofpolicies to implement state change from its parent cluster manager. Theparticular cluster manager in some embodiments distributes the receivedset of policies to its child cluster managers, which also distributethem to their progeny until all the cluster managers have received thesame set of policies.

Subsequently, after receiving the set of policies, the particularcluster manager separately receives from its parent cluster manager acommand and a set of policy-evaluating criteria. Like the receivedpolicies, the particular cluster manager in some embodiments distributesthe received command and policy-evaluating criteria set to its childcluster managers, which also distribute them to their progeny until allthe cluster managers have received the same command andpolicy-evaluating criteria set.

The command directs each cluster manager to implement a state change(e.g., perform security check) when the received set ofpolicy-evaluating criteria (e.g., a particular threshold value foravailable CPU cycles) satisfies a group of the received policies (e.g.,a policy that allows security checks when there are more than athreshold amount of CPU cycles). Each cluster manager determines whetherthe received set of policy-evaluating criteria satisfies a group of oneor more received policies. If so, the cluster manager processes thecommand to implement the state change on the resource cluster that itmanages. In some embodiments, the cluster manager sends a notificationto the cluster manager that sent the command either directly, or throughany intervening cluster managers in the cluster manager hierarchy.

FIGS. 9-11 illustrates another way to distribute desired state. Todistribute some desired state data, the root cluster manager 900 in someembodiments first has to collect some actual state data from its progenyclusters. Accordingly, as shown in FIG. 9 , the root cluster manager 900in some embodiments sends to its child cluster managers 902 and 904 astate request with criteria that allow the progeny cluster managers tocollect state data to report back to the root cluster manager in orderfor it to be able to make a decision as to how to implement its desiredstate.

For instance, in some embodiments, a root cluster manager sends arequest to all its child cluster managers to find an optimal placementfor a single instance of a resource, e.g., “find a cluster that has 5 PBfree storage, and 40 servers with GPU.” As shown in FIG. 9 , each of theroot child cluster managers sends the state request to its child clustermanagers (e.g., 906, 908, 910), which in turn send it to their childcluster managers, and so on.

As shown in FIG. 10 , each downstream cluster manager that can satisfythe criteria of the state request (e.g., can use the criteria as queryor match attributes to identify state data to return), provides therequested state data to its parent cluster manager. For instance, in theabove-described example, any cluster manager downstream that can findenough free space and appropriate number of servers with GPUs, sends anupstream report with a number that defines how “good” that request wouldfit in the free space available to it.

In some embodiments, each cluster manager that gets such a report from adownstream cluster manager discards the report of the downstream clustermanager when it already has a better solution (i.e., has a betteridentified state) for the request identified on its resource cluster oron a resource cluster of one of its progeny cluster managers. On theother hand, each particular progeny cluster manager sends up a reportfrom a downstream cluster manager when the report is better than theparticular progeny cluster manager's own report (if any) and otherreports provided by downstream cluster managers of the particularprogeny cluster manager. In some embodiments, each cluster manager sendsupstream the N best solutions that it identifies, where N is an integerthat is two or greater.

In some embodiments, each particular progeny cluster manager sends upits report after waiting a certain duration of time to receive inputfrom its child cluster managers (e.g., after an expiration of a timerthat it sets when it sent down the state data request to its childclusters). When the state request from a parent cluster manager providesa criteria (e.g., identify hosts that have more than 50% capacity), thechild cluster manager in some embodiments reports to its parent clustermanager the first state data that it identifies from its own resourcecluster or from a response from one of its progeny cluster managers.

The root cluster manager in some embodiments accepts the placementidentified in the first report that it receives, or the best report thatit receives after a certain duration of time, or the best report that itreceives after receiving responses from all of its direct child clustermanagers (i.e., all the direct child cluster managers of the top clustermanager). The root cluster manager 900 processes the requested statedata that it receives, and identifies a particular desired state todistribute to one or more cluster managers.

FIG. 11 illustrates that based on the requested state data that itreceives, the root cluster manager 900 identifies a particular desiredstate (e.g., a deployment of a Pod) that it needs the cluster manager906 to implement in the resource cluster that it manages. Through itschild cluster manager 902, the root cluster manager 900 sends a command(e.g., to deploy a Pod) for the cluster manager 906 to process in orderto effectuate the desired state change the resource cluster managed bythe cluster manager 906.

In other embodiments, the root cluster manager directly sends thecommand to the cluster manager 906. After receiving the command, thecluster manager 906 executes the command (e.g., deploys a Pod) in orderto achieve the desired state in the resource cluster that it manages.Once the cluster manager 906 processes the command and achieves thedesired state, the cluster manager 906 in some embodiments sends anotification of the state change to the requesting cluster manager(which in this case is the root cluster manager 900) directly, orthrough its intervening ancestor cluster manager(s) (which in this caseis cluster manager 902).

In some embodiments, cluster managers that get the state change need todecide how to translate it for cluster managers reporting in. Considerthe following example of a desired state change: “start as few aspossible database servers of type X to collect detailed stats for allclusters.” This desired state change is then delegated from an ancestorcluster manager (e.g., the root node cluster manager) down to all itsprogeny cluster managers in order to delegate the “placement decisionmaking” from the root cluster manager down to downstream clustermanagers in the cluster manager hierarchy.

The management hierarchy of some embodiments works well with policiesand templates that can be pushed down to ensure that all clustermanagers have a uniform list of policies, or by pushing them up so thatthe top cluster manager knows which policies are supported by thedownstream cluster managers.

When cluster managers are in a DAG structure, and a particular clustermanager fails or a connection between the particular cluster manager andits child manager fails, its ancestor cluster managers no longer havevisibility into the progeny cluster managers of the particular clustermanager. To address such cluster manager failures (to create a highavailability management hierarchy), different embodiments employdifferent techniques.

For instance, in some embodiments, each cluster manager has a list ofpossible upstream parent cluster managers so that when the clustermanager's parent CM fails, the cluster manager can identify anotherupstream cluster manager on the list and connect to the identifiedupstream manager as its new parent cluster manager. In order to keep alldata correct, some embodiments require that any cluster manager thatloses contact with a downstream cluster immediately remove the data itreports up in order to avoid the data from getting reported twice toupstream cluster managers.

FIG. 12 illustrates a process 1200 that each non-root cluster managerperforms in some embodiments to be connected at all times to one parentcluster manager. As shown, the process 1200 initially receives (at 1205)a list of possible parent CMs. Next, at 1210, the process 1200identifies one cluster manager in the list as its parent clustermanager. In some embodiments, the received list identifies an initialparent cluster manager and/or an order for the process 1200 to use toselect successive parent cluster managers. In other embodiments, theprocess 1200 selects a parent cluster manager based on some heuristics.

At 1215, the process 1200 establishes a parent-child relationship withthe parent cluster manager identified at 1210. In some embodiments, theprocess establishes this relationship by communicating with theidentified cluster manager to register with it as one of its childcluster managers. This registration in some embodiments establishesbetween the two cluster managers a tunnel for the two cluster managersto use for the communications (e.g., to exchange packets to pass desiredstate downstream and actual state upstream). In other embodiments, twoclusters communicate through other mechanisms, e.g., through VPN(virtual private network) connections, Ethernet, Internet, etc.

After this registration, the process 1200 in some embodiments starts (at1220) monitoring the health of the parent cluster manager and theconnection link to the parent cluster manager. The health monitoring insome embodiments involves exchanging keep alive messages with the parentcluster manager. At 1225, the process determines whether it has detectedfailure of the parent cluster manager or the connection link to theparent cluster manager. If not, the process returns to 1220 to continuemonitoring health of the parent cluster manager and the connection linkto the parent cluster manager. When the process detects failure of theparent cluster manager or the connection link to the parent clustermanager, it returns to 1210 to identify another cluster manager in thelist of candidate cluster managers.

FIG. 13 illustrates a process 1300 that a parent cluster managerperforms in some embodiments to manage relationship with its childcluster managers. As shown, the process starts (at 1305) when the parentcluster manager receives a request from a cluster manager to establishparent-child relationship. At 1310, the process 1300 establishes aparent-child relationship with the child cluster manager. In someembodiments, the process establishes this relationship by exchanginginformation with the child cluster manager to allow the two clustermanagers to establish a communication tunnel to use for theircommunications (e.g., to exchange packets to pass desired statedownstream (i.e., to pass commands downstream) and actual stateupstream).

Next, at 1315, the process 1300 exchanges health monitoring messageswith its child cluster manager(s). Health monitoring in some embodimentsinvolves exchanging keep alive messages with the parent cluster manager.At 1320, the process determines whether it has detected failure of aconnection with a child cluster manager. Such a failure in someembodiments can be due to the child cluster manager crashing (i.e.,suffering an operation failure) or due to the failure of the connectionlink with the child cluster manager. If not, the process transitions to1330, which will be described below.

When the process 1300 detects (at 1320) a failed connection to a childcluster manager, the process 1300 (at 1325) removes the child clustermanager from its lists of child cluster managers, identifies newupstream state data that removes the failed child cluster manager'sstate data, and sends the updated state data to its parent clustermanager (i.e., the parent cluster manager of the cluster manager that isperforming the process 1300). From 1325, the process transitions to1330.

At 1330, the process determines whether it has received a new requestfrom a new cluster manager to establish parent-child cluster managerrelationship. If so, the process returns to 1310 to establishparent-child relationship with the new cluster manager. Otherwise, theprocess determines (at 1335) whether it has received notification from achild cluster manager that the connection to one of its progeny clustermanagers has failed. If not, the process 1300 returns to 1315 tocontinue its health monitoring operations.

When the process 1300 receives such a notification, the process 1300 (at1340) updates its state data based on the state data received with thenotification, and sends its updated state data to its parent clustermanager (i.e., the parent cluster manager of the cluster manager that isperforming the process 1300). After 1340, the process returns to 1315 tocontinue its health monitoring operations. The process 1300 continuesuntil it has removed its last child cluster manager, at which time itterminates.

Other embodiments allow a cluster manager to connect to more than oneupstream cluster manager and split the resources in some way betweenthose cluster managers. For instance, FIG. 14 illustrates an embodimentin which each cluster manager reports its X% of its data to each of Nupstream cluster managers. In this example, the cluster manager 1400 hasfour parent cluster managers 1402-1308 with each parent getting 25% ofthe state data of the cluster manager 1400. In some embodiments,different parent cluster managers 1402-1308 can receive differentamounts of the state data from the cluster manager 1400.

In the example illustrated in FIG. 14 , when one upstream CM fails orthe connection between this CM and its child CM fails, only a fractionof the resources will be temporarily invisible to the upstream clustermanagers. This HA approach is combined with the above-described HAprocess 1200 in some embodiments to allow a child CM to connect toanother parent CM when its prior parent CM, or the connection to itsprior parent CM, fails.

Some embodiments also employ novel processes to avoid loops in themanagement hierarchy. When any cluster manager connects to an upstreamcluster manager that in turn, possibly over several connections,connects to itself, a loop is formed in the management hierarchy. Someor all of the cluster managers in the hierarchy in some embodiments areconfigured to detect such loops by detecting that the data that theycollect increases without bounds.

FIG. 15 illustrates examples of cluster manager 1505 and resources inmultiple datacenters 1500. The cluster managers 1505 in each datacenter1500 are several servers that manage a variety of different clustersresources, such as host computers 1510, and machines (e.g., VMs, Pods,containers, etc.) 1512, software forwarding elements 1514 and serviceengines 1516, all executing on the host computer 1512. The clustermanager servers 1505 in some embodiments are machines that execute onhost computers along with the machines 1512, while in other embodimentsthe severs 1505 execute on their own dedicated computers. Also, in otherembodiments, the cluster managers 1505 manage other types of resourceclusters, such as standalone forwarding elements (standalone routers,switches, gateways, etc.), middlebox service appliances, compute andnetwork controllers and managers, etc.

As shown, in each datacenter 1500 the cluster managers 1505 communicatewith the resource clusters (e.g., host clusters, machine clusters, SFEclusters, service engine clusters, etc.) through a datacenter network(e.g., a local area network) 1530. The datacenters are linked throughone or more networks 1535 (e.g., Internet, or other private or publicnetwork). Through the network(s) 1535, one cluster manager in onedatacenter can direct another cluster manager in another datacenter fordownstream desired-state propagation or upstream realized-statepropagation. Also, through each datacenter' s network, one clustermanager in the datacenter can direct another cluster manager in the samedatacenter for downstream desired-state propagation or upstreamrealized-state propagation.

Many of the above-described features and applications are implemented assoftware processes that are specified as a set of instructions recordedon a computer readable storage medium (also referred to as computerreadable medium). When these instructions are executed by one or moreprocessing unit(s) (e.g., one or more processors, cores of processors,or other processing units), they cause the processing unit(s) to performthe actions indicated in the instructions. Examples of computer readablemedia include, but are not limited to, CD-ROMs, flash drives, RAM chips,hard drives, EPROMs, etc. The computer readable media does not includecarrier waves and electronic signals passing wirelessly or over wiredconnections.

In this specification, the term “software” is meant to include firmwareresiding in read-only memory or applications stored in magnetic storage,which can be read into memory for processing by a processor. Also, insome embodiments, multiple software inventions can be implemented assub-parts of a larger program while remaining distinct softwareinventions. In some embodiments, multiple software inventions can alsobe implemented as separate programs. Finally, any combination ofseparate programs that together implement a software invention describedhere is within the scope of the invention. In some embodiments, thesoftware programs, when installed to operate on one or more electronicsystems, define one or more specific machine implementations thatexecute and perform the operations of the software programs.

FIG. 16 conceptually illustrates a computer system 1600 with which someembodiments of the invention are implemented. The computer system 1600can be used to implement any of the above-described computers andservers. As such, it can be used to execute any of the above describedprocesses. This computer system includes various types of non-transitorymachine readable media and interfaces for various other types of machinereadable media. Computer system 1600 includes a bus 1605, processingunit(s) 1610, a system memory 1625, a read-only memory 1630, a permanentstorage device 1635, input devices 1640, and output devices 1645.

The bus 1605 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of thecomputer system 1600. For instance, the bus 1605 communicativelyconnects the processing unit(s) 1610 with the read-only memory 1630, thesystem memory 1625, and the permanent storage device 1635.

From these various memory units, the processing unit(s) 1610 retrieveinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments. Theread-only-memory (ROM) 1630 stores static data and instructions that areneeded by the processing unit(s) 1610 and other modules of the computersystem. The permanent storage device 1635, on the other hand, is aread-and-write memory device. This device is a non-volatile memory unitthat stores instructions and data even when the computer system 1600 isoff. Some embodiments of the invention use a mass-storage device (suchas a magnetic or optical disk and its corresponding disk drive) as thepermanent storage device 1635.

Other embodiments use a removable storage device (such as a flash drive,etc.) as the permanent storage device. Like the permanent storage device1635, the system memory 1625 is a read-and-write memory device. However,unlike storage device 1635, the system memory is a volatileread-and-write memory, such a random access memory. The system memorystores some of the instructions and data that the processor needs atruntime. In some embodiments, the invention's processes are stored inthe system memory 1625, the permanent storage device 1635, and/or theread-only memory 1630. From these various memory units, the processingunit(s) 1610 retrieve instructions to execute and data to process inorder to execute the processes of some embodiments.

The bus 1605 also connects to the input and output devices 1640 and1645. The input devices enable the user to communicate information andselect commands to the computer system. The input devices 1640 includealphanumeric keyboards and pointing devices (also called “cursor controldevices”). The output devices 1645 display images generated by thecomputer system. The output devices include printers and displaydevices, such as cathode ray tubes (CRT) or liquid crystal displays(LCD). Some embodiments include devices such as a touchscreen thatfunction as both input and output devices.

Finally, as shown in FIG. 16 , bus 1605 also couples computer system1600 to a network 1665 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofcomputer system 1600 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra-density optical discs, and any other optical ormagnetic media. The computer-readable media may store a computer programthat is executable by at least one processing unit and includes sets ofinstructions for performing various operations. Examples of computerprograms or computer code include machine code, such as is produced by acompiler, and files including higher-level code that are executed by acomputer, an electronic component, or a microprocessor using aninterpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself.

As used in this specification, the terms “computer”, “server”,“processor”, and “memory” all refer to electronic or other technologicaldevices. These terms exclude people or groups of people. For thepurposes of the specification, the terms display or displaying meansdisplaying on an electronic device. As used in this specification, theterms “computer readable medium,” “computer readable media,” and“machine readable medium” are entirely restricted to tangible, physicalobjects that store information in a form that is readable by a computer.These terms exclude any wireless signals, wired download signals, andany other ephemeral or transitory signals.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. Thus, one of ordinary skill in the artwould understand that the invention is not to be limited by theforegoing illustrative details, but rather is to be defined by theappended claims.

1. A method of managing resources arranged in a hierarchy in at leastone datacenter, the method comprising: defining different resourcemanagers for different resource cluster at different level of theresource hierarchy in the datacenter; associating each of a plurality ofchild resource managers to have one parent resource manager, each childor parent resource manager managing a resource cluster, each childresource manager providing state data regarding the child resourcemanager's resource cluster to its associated parent resource manager,and configuring each particular resource manager to send notification toremove data regarding a child resource manager of the particularresource manager to the parent resource manager of the particularresource manager when the particular resource manager loses connectionwith the child resource manager.
 2. The method of claim 1, wherein theparticular resource cluster loses connection with the child resourcecluster when the child resource cluster has an operational failure or anetwork connectivity between the particular resource cluster and thechild resource cluster fails.
 3. The method of claim 1 furthercomprising configuring each pair of parent and child resource clustermanagers to exchange control messages to ensure that the connectionbetween the pair is maintained, said particular resource managerdetecting that the connection with the child resource manager has failedafter.
 4. The method of claim 1 further comprising associating eachparticular child resource manager of a set of child resource managers tohave at least one child resource manager of its own that manages aresource cluster that is a child resource cluster of the resourcecluster managed by the particular child resource manager, the particularchild resource manager receiving state data from each of its childresource manager regarding the resource cluster managed by that childresource manager.
 5. The method of claim 1 further comprisingconfiguring each particular resource manager of a plurality of resourcemanagers to pass along the state data provided by a first set of progenyresource managers of the particular resource manager to the parentresource manager of the particular resource manager without aggregatingthe provided state data, and to aggregate state data provided by asecond set of progeny resource managers of the particular resourcemanager to the parent resource manager as each progeny resource managerin the second set has reached a maximum upstream state propagationlevel.
 6. The method of claim 1, wherein each parent resource manager toprovide commands to each of its child resource managers to direct thechild resource manager to effectuate a change in state of the resourcecluster managed by the child resource manager.
 7. The method of claim 1further comprising: providing each resource manager of a set of resourcemanagers with a list of potential parent resource managers; andconfiguring each particular resource manager of the set of resourcemanagers to detect that a connection with a parent first resourcemanager has failed, to select a second resource manager from the list,and to establish a connection with the second resource manager as theparent resource manager of the particular resource manager.
 8. Themethod of claim 7, wherein before the connection to the first parentresource manager fails, each particular resource manager of the set ofresource managers provides state data regarding the resource clustermanaged by the particular resource manager to the first parent resourcemanager; and after the connection to the first parent resource managerfails, each particular resource manager of the set of resource managersprovides state data regarding the resource cluster managed by theparticular resource manager to the second parent resource manager. 9.The method of claim 1 further comprising configuring each particularresource manager to send notification to remove data regarding a progenyresource manager of the particular resource manager to the parentresource manager of the particular resource manager when the particularresource manager receives notification from a child resource managerthat state data associated with one of its progeny resource managersshould be updated due to loss of connectivity to at least one of itsprogeny resource managers.
 10. The method of claim 1, wherein theresource clusters comprise compute clusters and network clusters, andsaid managers are machines executing on the datacenter, said machinesbeing one of containers, Pods, virtual machines and standalonecomputers.
 11. A system for managing resources arranged in a hierarchyin at least one datacenter, the system comprising: different resourcemanagers for different resource cluster at different level of theresource hierarchy in the datacenter; each of a plurality of childresource managers associated with one parent resource manager, eachchild or parent resource manager managing a resource cluster, each childresource manager providing state data regarding the child resourcemanager's resource cluster to its associated parent resource manager,and each particular resource manager configured to send notification toremove data regarding a child resource manager of the particularresource manager to the parent resource manager of the particularresource manager when the particular resource manager loses connectionwith the child resource manager.
 12. The system of claim 11, wherein theparticular resource cluster loses connection with the child resourcecluster when the child resource cluster has an operational failure or anetwork connectivity between the particular resource cluster and thechild resource cluster fails.
 13. The system of claim 11, wherein eachpair of parent and child resource cluster managers is configured toexchange control messages to ensure that the connection between the pairis maintained, said particular resource manager detecting that theconnection with the child resource manager has failed after.
 14. Thesystem of claim 11, wherein each particular child resource manager of aset of child resource managers is associated with at least one childresource manager of its own that manages a resource cluster that is achild resource cluster of the resource cluster managed by the particularchild resource manager, the particular child resource manager receivingstate data from each of its child resource manager regarding theresource cluster managed by that child resource manager.
 15. The systemof claim 11, wherein each particular resource manager of a plurality ofresource managers is configured to pass along the state data provided bya first set of progeny resource managers of the particular resourcemanager to the parent resource manager of the particular resourcemanager without aggregating the provided state data, and to aggregatestate data provided by a second set of progeny resource managers of theparticular resource manager to the parent resource manager as eachprogeny resource manager in the second set has reached a maximumupstream state propagation level.
 16. The system of claim 11, whereineach parent resource manager to provide commands to each of its childresource managers to direct the child resource manager to effectuate achange in state of the resource cluster managed by the child resourcemanager.
 17. The system of claim 11, wherein each resource manager of aset of resource managers is provided with a list of potential parentresource managers; and each particular resource manager of the set ofresource managers is configured to detect that a connection with aparent first resource manager has failed, to select a second resourcemanager from the list, and to establish a connection with the secondresource manager as the parent resource manager of the particularresource manager.
 18. The system of claim 17, wherein before theconnection to the first parent resource manager fails, each particularresource manager of the set of resource managers provides state dataregarding the resource cluster managed by the particular resourcemanager to the first parent resource manager; and after the connectionto the first parent resource manager fails, each particular resourcemanager of the set of resource managers provides state data regardingthe resource cluster managed by the particular resource manager to thesecond parent resource manager.
 19. The system of claim 11, wherein eachparticular resource manager is configured to send notification to removedata regarding a progeny resource manager of the particular resourcemanager to the parent resource manager of the particular resourcemanager when the particular resource manager receives notification froma child resource manager that state data associated with one of itsprogeny resource managers should be updated due to loss of connectivityto at least one of its progeny resource managers.
 20. The system ofclaim 11, wherein the resource clusters comprise compute clusters andnetwork clusters, and said managers are machines executing on thedatacenter, said machines being one of containers, Pods, virtualmachines and standalone computers.